17 research outputs found

    Why is German dependency parsing more reliable than constituent parsing?

    Get PDF
    In recent years, research in parsing has extended in several new directions. One of these directions is concerned with parsing languages other than English. Treebanks have become available for many European languages, but also for Arabic, Chinese, or Japanese. However, it was shown that parsing results on these treebanks depend on the types of treebank annotations used. Another direction in parsing research is the development of dependency parsers. Dependency parsing profits from the non-hierarchical nature of dependency relations, thus lexical information can be included in the parsing process in a much more natural way. Especially machine learning based approaches are very successful (cf. e.g.). The results achieved by these dependency parsers are very competitive although comparisons are difficult because of the differences in annotation. For English, the Penn Treebank has been converted to dependencies. For this version, Nivre et al. report an accuracy rate of 86.3%, as compared to an F-score of 92.1 for Charniaks parser. The Penn Chinese Treebank is also available in a constituent and a dependency representations. The best results reported for parsing experiments with this treebank give an F-score of 81.8 for the constituent version and 79.8% accuracy for the dependency version. The general trend in comparisons between constituent and dependency parsers is that the dependency parser performs slightly worse than the constituent parser. The only exception occurs for German, where F-scores for constituent plus grammatical function parses range between 51.4 and 75.3, depending on the treebank, NEGRA or TüBa-D/Z. The dependency parser based on a converted version of Tüba-D/Z, in contrast, reached an accuracy of 83.4%, i.e. 12 percent points better than the best constituent analysis including grammatical functions

    Families and resemblances

    Get PDF

    Families and resemblances

    Get PDF

    Applying the Levenshtein Distance to Catalan dialects: A brief comparison of two dialectometric approaches 1

    Get PDF
    Abstract. In recent years, dialectometry has gained interest among Catalan dialectologists. As a consequence, a specific dialectometric approach has been developed at the University of Barcelona, which aims at increasing the accuracy of final groupings by means of discriminating the predictable components of the language from its unpredictable ones. Another popular method to obtain dialect distances is the Levenshtein Distance (LD) which has never been applied to a Catalan corpus so far. The goal of this paper is to present the results of applying the LD to a corpus of Catalan linguistic data, and to compare the results from this analysis both with the results from Barcelona and the traditional classifications of Catalan dialectology. 1

    DH Benelux Journal 4. The Humanities in a Digital World

    Get PDF
    The fourth volume of the DH Benelux Journal. This volume includes seven full-length, peer-reviewed articles that are based on accepted contributions to the 2021 virtual DH Benelux conference. Contents: 1. Editors' Preface (Wout Dillen, Margherita Fantoli, Marijn Koolen, Marieke van Erp); 2. Introduction: The Humanities in a Digital World (Lorella Viola, Jelena Prokic, Antske Fokkens, Tommaso Caselli); 3. A Game of Persistence, Self-doubt, and Curiosity: Surveying Code Literacy in Digital Humanities (Elli Bleeker, Marijn Koolen, Kaspar Beelen, Liliana Melgar, Joris van Zundert, Sally Chambers); 4. Introducing the DHARPA Project: An Interdisciplinary Lab to Enable Critical DH Practice (Angela R. Cunningham, Helena Jaskov, Sean Takats, Lorella Viola); 5. Examining a Multi Layered Approach for Classification of OCR Quality without Ground Truth (Mirjam Cuper); 6. Modeling Ontologies for Individual Artists: A Case Study of a Dutch Ceramic Glass Sculptor (Victor de Boer, Daan Raven, Erik Esmeijer, Johan Oome); 7. Judging a Book by its Criticism: A Digital Analysis of the Professional and Community Driven Literary Criticism of the Ingeborg-Bachmann-Preis (Lore De Greve, Gunther Martens); 8. When No News is Bad News. News-Based Change Detection during COVID-19 (Kristoffer L. Nielbo, Frida Hæstrup, Kenneth C. Enevoldsen, Peter B. Vahlstrup, Rebekah B. Baglini, Andreas Roepstorff); 9. Combining Tools with Linked Data: a Social History Example (Ivo Zandhuis)

    Families and resemblances

    Get PDF
    Dialectometry is a multidisciplinary field that uses quantitative methods in the analysis of dialect data. From the very beginning, most of the research in dialectometry has been focused on including large amounts of data in analyses and offering alternative views to researchers. Later it was used for the identification of dialect groups and development of methods that would tell us how similar (or different) one variety is when compared to the neighboring varieties. In this book we present advances in several techniques that allow the researcher to automatically measure the differences between language varieties. We test all methods on Bulgarian dialect pronunciation data. Part of the research presented relies on the Levenshtein algorithm to aggregate over the numerous features found in the data and infer the similarities/distances among the groups of dialects. We investigate the application of clustering techniques in the detection of dialect groups, and propose several evaluation techniques that can be used to estimate the quality of the automatically obtained groups. In order to automatically infer the distances between the phones in the data set we combine the Levenshtein algorithm with the technique called pointwise mutual information. Information on the distances between the phones helps us get better estimates on the distances between the strings, and consequently on the distances between language varieties. In this thesis we also test an alternative approach to dialect variation that is more historically motivated. We employ a method taken from phylogenetics, namely Bayesian inference of phylogeny, which focuses on systematic shared innovations as a signal of common ancestry, and reexamine the relatedness among the Bulgarian dialect varieties. This method is applied to the automatically multiply aligned strings, which we produce and evaluate using two novel methods. The results of applying different quantitative techniques to the Bulgarian dialect data show that some of the traditional divisions of this area have to be questioned if only pronunciation data is taken into account. The comparison of the divisions resulting from the geographic and historical approaches has shown that these two different perspectives gave very similar picture of the Bulgarian dialect variation. None of the methods developed are language specific, nor are they applicable only to the dialect data.

    Dialectology for computational linguists

    Get PDF
    corecore